Improving prediction of blood cancer using leukemia microarray gene data and Chi2 features with weighted convolutional neural network

Blood cancer has emerged as a growing concern over the past decade, necessitating early diagnosis for timely and effective treatment. The present diagnostic method, which involves a battery of tests and medical experts, is costly and time-consuming. For this reason, it is crucial to establish an automated diagnostic system for accurate predictions. A particular field of focus in medical research is the use of machine learning and leukemia microarray gene data for blood cancer diagnosis. Even with a great deal of research, more improvements are needed to reach the appropriate levels of accuracy and efficacy. This work presents a supervised machine-learning algorithm for blood cancer prediction. This work makes use of the 22,283-gene leukemia microarray gene data. Chi-squared (Chi2) feature selection methods and the synthetic minority oversampling technique (SMOTE)-Tomek resampling is used to overcome issues with imbalanced and high-dimensional datasets. To balance the dataset for each target class, SMOTE-Tomek creates synthetic data, and Chi2 chooses the most important features to train the learning models from 22,283 genes. A novel weighted convolutional neural network (CNN) model is proposed for classification, utilizing the support of three separate CNN models. To determine the importance of the proposed approach, extensive experiments are carried out on the datasets, including a performance comparison with the most advanced techniques. Weighted CNN demonstrates superior performance over other models when coupled with SMOTE-Tomek and Chi2 techniques, achieving a remarkable 99.9% accuracy. Results from k-fold cross-validation further affirm the supremacy of the proposed model.

• To enhance the predictive accuracy of blood cancer detection, a novel hybrid model weighted CNN is intro- duced.The weighted CNN combines three CNN models through a majority voting mechanism.The impact of the synthetic minority oversampling technique (SMOTE)-Tomek on data balancing is explored.hi-square (Chi2) is used to determine the best set of features for classification, and SMOTE-Tomek (synthetic minority oversampling technique) is used to equalize the class-imbalance problem using the upsampling technique.• The performance of well-known machine learning methods on microarray gene data is evaluated in this work.These techniques include decision trees (DT), AdaBoost classifier (ADA), k-nearest neighbor (KNN), random forest (RF), logistic regression (LR), support vector classifier (SVC), and Naive Bayes (NB).• The effectiveness of the proposed approach is thoroughly examined through extensive experiments, and a comparative analysis with various state-of-the-art methods is conducted.K-fold cross-validation is used to further substantiate the results in order to validate the robustness of the proposed approach.
The rest of the paper follows a structured format.Section delves into the latest advancements in DNA microarray techniques, providing a detailed examination of specific methodologies.Moving on to "Material and methods", we present the proposed approach and elucidate the microarray datasets utilized in our experiments."Experiments and analysis" outlines the outcomes of these experiments.It wraps up the study in "Conclusion", where we provide a summary of the findings and explore possible directions for further research.

Related work
Using microarray data to classify cancer is a common research area in machine learning.Researchers propose and evaluate various methods and models for this task.Karim et al 13 proposed an automated system for leukemia cancer classification using the microarray gene data.They used machine learning and ensemble learning models for the prediction of leukemia.To handle the class imbalance problem they used the SMOTE upsampling method.The result of the study shows that the proposed LDSVM achieved an accuracy score of 99.8%.Similar to our work 14 , another study examined the diagnostic performance of a deep learning approach using the Leu-kemia_GSE9476 dataset in comparison to conventional methods.The dataset had gene data from Leukemia microarrays, including 22,283 genes.They used normalization tests at the beginning, and for training and testing, they used a DNN neural network.The results showed that the accuracy of the deep learning network was 0.96%, while the accuracy of the standard technique was 0.63%.The Leukemia_GSE28497 dataset was utilized in a different study 15 to assess the interoperability of several microarray and ribonucleic acid (RNA)-seq systems.In their investigation, they examined four different kinds of leukemia samples.To choose the best features, they used a method called minimum redundancy maximum relevance (mRMR).Findings indicate that a small fraction of ten genes can yield an accuracy of 97.29% .To confirm the model's effectiveness for multi-class classification, the statistical test known as analysis of variance (ANOVA) is run.For the multi-class classification Fauzi et al. 16 proposed an automated system that uses the Fuzzy Support Vector Machine with the feature selection technique PCA.They performed the experiments in two scenarios.In the first experiment, they used the FSVM on the original features, and in the second experiment, they used the FSVM on the PCA features.The result of the study shows that the proposed FSVM outperformed the PCA features and achieved an accuracy score of 96.92%, whereas on the original features it achieved an accuracy score of 87.69%.
Abd El-Nasser et al. 17 developed an Enhanced Classification Algorithm (ECA) that combines the Select Most Informative Genes (SMIG) module with standardization, achieving 98% accuracy in just 0.1 seconds when preprocessing and classification are performed.This approach outperformed previous methods.In a separate study, Sanaz Mehrabani et al. 18 employed machine learning models and sparsity-based gene selection techniques to detect blood cancer using leukemia gene expression data.They employed two machine learning models such as RF and SVM for the prediction of blood cancer.Results of the study reveal that the RF achieved the highest accuracy on the ten genes selected by norm and norm sparsity-based gene selection and SVM achieved the highest accuracy on the norm method.Mahdi et al 19 proposed a fusion of the stochastic gradient descent approach with supervised principal component analysis (SPCA) for the leukemia prognosis.This system has the ability to handle the large dimensional space efficiently.The results reveal that SGD-SPCA stands out as a highly efficient method, achieving an impressive 99.1% accuracy on the training data.In a separate study, Loey et al. 20 developed classification models to distinguish between blood microscopic images of individuals with and without leukemia.They leveraged a pre-trained Convolutional Neural Network (CNN) called AlexNet for feature extraction, combined with various classifiers.The results show that Support Vector Machine (SVM) outperformed other classifiers in terms of performance.Moreover, AlexNet demonstrated its superiority over other models across different metrics when used for both feature extraction and classification.
In reference to cancer prediction, a multi-model ensemble is introduced by Yawen Xiao et al. 21.This work examined gene data that were taken from lung, breast, and stomach tissues.To prevent overfitting in classification, the authors employed the DESeq approach, which effectively pinpointed genetic distinctions between normal and tumor phenotypes.Additionally, this approach managed data dimensionality, leading to improved forecast accuracy and a noteworthy reduction in computational time.They achieved the highest accuracy of 98.78% using a deep learning ensemble model.Ancona et al. 22 explore gene tumors related to colon cancer in their study.Their results highlight that the voting categories yield optimal outcomes when applied to an individual classifier.Cancer classification involves constructing models on datasets, specifically those containing microarray gene expression information.The process involves discerning the class value for each case in the sample by creating the model.Diagnostic findings based on this approach may empower doctors to implement a suitable therapy protocol, especially during the early stages of diagnosis and treatment.
Despite the impressive outcomes detailed in the aforementioned studies, the utilization of microarray gene data for predicting blood cancer remains relatively underexplored.Furthermore, aside from a few research endeavors, the reported accuracy in the remaining studies is insufficient for effective blood cancer prediction.The complete summary of the related work is shown in Table 1.

Material and methods
This section contains details on the dataset, techniques, and methods applied to blood cancer prediction.Step by step complete proposed methodology diagram is shown in Fig. 1.

Description of dataset
This study uses the Leukemia_GSE28497 dataset 23 .There are seven classifications, 22,283 genes (features), and 281 samples in total.Class-wise count of the dataset is given in Table 2.
In Table 3, the original dataset example is provided.This explains the real data that was obtained by applying the microarray gene technique.Blood cancer types are described by attributes, while the properties of the genes that are utilized to distinguish affected individuals from unaffected individuals are shown in another column.A microarray test is used to determine the values of these features.Every instance is described in a single row and contains information about the blood sample monitored by the microarray.www.nature.com/scientificreports/

Data preprocessing
The dataset was made available via the National Centre for Biotechnology Information (NCBI) website.Subsequently, preprocessing steps are implemented to enhance the effectiveness of learning models 24 .This includes data resampling for balancing the dataset using the SMOTE-Tomek technique, and generating data for minority classes, as detailed in Table 4.The Chi-square method is then used for feature selection in order to reduce complexity.The top 300 gene features for the best learning model fitting are given in Table 5.
The selection of features is guided by empirical findings.The preprocessed dataset is then divided in an 85:15 ratio into training and testing sets given in Table 6.
With 85% dedicated to training, the models receive ample data for effective learning, given the dataset's overall size.Post-training, the remaining 15% is utilized to assess the model's performance using measures like F1 score, recall, accuracy, and precision.This preprocessing ensures the dataset is well-prepared for training and evaluation, enhancing the learning models' performance.www.nature.com/scientificreports/

SMOTE-Tomek
SMOTE-Tomek is a technique that addresses the issue of overlap by combining SMOTE and Tomek Links.Tomek Links identifies pairs of instances from different classes and introduces new instances randomly 25 .If the distance between the new instance and either selected pair is smaller than the distance of the pair itself, the chosen pair is eliminated.In essence, Tomek Links can eliminate instances situated at the boundaries of multiple classes, considering the adjacencies between instances near these boundaries.It targets instances considered to belong to multiple classes due to SMOTE, thereby eliminating those responsible for overlap.However, in this scenario, the instance being removed is not likely to be adjacent to its proper class, potentially increasing the chance of introducing unwanted noise.

Chi-square
Chi-square is a non-parametric statistical method used for feature selection, specifically selecting the top n features.It is widely employed in data analysis tasks.Chi-2 assesses the independence between a given phrase and the presence of a specific class.In a given document D, the score for each term is estimated and ranked using the Eq.(1) In this context, let N represent the observed frequency and E denote the predicted frequency.When t is present, the variable 4 t is assigned a value of 1; otherwise, it is assigned a value of 0, and e c is allocated a value of 1 if the document is part of the c class and 0 otherwise.The null hypothesis H0, which asserts independence (i.e., that the document class does not influence the term's frequency), should be rejected, according to a significant Chi2 score for each characteristic.This implies that there is interdependence between the feature and the class.Consequently, in such instances, it is advisable to select the microarray gene feature for model training.

Supervised machine learning models
This study uses a variety of machine learning models, such as KNN, RF, LR, ETC, SVC, ADA, NB, DT, and the proposed approach, WVCNN, to predict blood cancer.Table 7 provides implementation details about these machine learning models and their hyperparameter settings for all of them.To find the best parameters, a method (1) Ee t e c  www.nature.com/scientificreports/called grid search is used.This involves trying different values for each parameter within a specified range and evaluating how well the model works.Every parameter goes through the procedure, and the values that optimize the model's performance are selected at the conclusion.

Random forest
RF is a model that predicts very accurately by using a group of decision trees 26 .It combines the results from many decision trees.This model uses a technique called bagging, where it trains various decision trees on different groups of data.In each group, the training data is sampled with replacement, which means some data might be repeated.The size of this sample is like the size of the original training data.When making predictions, RF and other classifiers follow similar processes for creating decision-making groups.A key difficulty in developing these models is deciding the attributes of the main decision point at each step, determined using Eq. ( 2).

Logistic regression
LR is a method in mathematics that processes information using one or more variables to find a solution 27 .It's specifically used for predicting probabilities of class membership when dealing with categorical target variables.LR uses a logistic function to assess the likelihood of a relationship between the dependent variable and one or more independent variables.It is considered the most suitable learning model when dealing with categorical target variables.Equation (3) shows the logistic function used by LR

Support vector classifier
Classification involves organizing a dataset into categories using specific criteria to provide a more meaningful classification 28 .The classification approach known as SVC is based on the support vector methodology.SVC's primary goal is to identify the best-fitting hyperplane that effectively divides or arranges the given data.You may enter characteristics into the classifier to ascertain the anticipated class after the hyperplane has been constructed.This algorithm is well-suited for various applications, including our purpose, as it can be employed in different contexts.

K-Nearest neighbors
KNN is a fundamental machine learning model that is used for problems involving both regression and classification 29,30 .This model assigns a data point to a class based on its closest neighbors, determining proximity through a distance attribute.In this experiment, the effectiveness of KNN is demonstrated, particularly when k is set to five (k = 5).This signifies that the model considers the five nearest neighbors and selects a class according to the closest or majority distance.

Naive Bayes
The NB algorithm is an algorithm for classification problems that focuses on the Bayes theorem 31 .It is a supervised learning algorithm known for its efficiency and scalability in training with a restricted set of information.As a probabilistic classifier, the likelihood that an item will belong to a specific class is predicted by NB.One key assumption of the NB classifier is that each feature's likelihood is independent of others, and they don't overlap.This suggests that every attribute has an equal role in identifying whether a sample is a member of a certain class.
The NB classifier is simple to use, computes rapidly, and performs well on large, highly dimensional datasets.

Extra trees classifier
The ETC operates similarly to the random forest, with the main difference lying in the tree-building process 32 .Every decision tree in ETC is built with the first training sample.The Gini index is used to find the optimum way to split the data inside the tree, and k samples from the best solutions are used to make judgments.This results in the creation of multiple decision trees Using these examples of random function indicators, aiming to ensure they are not highly correlated.The decision tree algorithm is versatile, suitable for both categorical and numerical data, and it performs effectively in various scenarios.

Decision tree
The DT is a widely used and potent data-mining technique that has been extensively developed and tested by numerous researchers 33 .Despite its effectiveness, it is essential to acknowledge the presence of data errors during the learning process.Consequently, working with substantial volumes of data is crucial to creating a decision tree algorithm capable of producing a straightforward tree structure with high classification accuracy.For this study, a dataset from Kaggle was chosen to implement the decision tree algorithms.The strategic division of data significantly impacts the accuracy of the tree, with different decision criteria utilized for classification and regression tasks.
The entropy is expressed mathematically for 1 attribute by Eq. ( 4) (2) Entropy is expressed mathematically for multiple attributes by Eq. ( 5) IG is defined mathematically by Eq. ( 6)

AdaBoost
This group learning model uses a method called boosting to teach unskilled students, like decision trees.ADA, short for adaptive boosting, is important in history 34 .It was the first system that could change and help weak learners.The ADA method brings together 'weak learners, ' teaching them one after the other back on copies of the first data set.All weaker learners concentrate on tough data points or unusual numbers.It works like an overall model, using N repeats of weak learners trained on the same set of features.These are given different weights.ADA shows it works with strong math reasons.Many experiments have shown that ADA, a type of machine learning system, is often better than other systems.
The study used the ADA method with different values.It adjusted these values well to get high accuracy.The 'n_estimator' value was set to 300.This means that the ADA machine used 300 little learners to make guesses.It's important to point out the difference between RF and ADA.RF uses a method called bagging, while ADA uses a boosting strategy.Boosting should be understood as combining weak learners into one strong learner by putting weight on them in a specific way.'Random_state' is another factor used.It controls how random the samples are during training for predicting models.

Artificial neural networks
Artificial neural networks (ANNs) are computational models inspired by the human brain's neural networks, designed to recognize patterns and solve complex problems 35 .ANNs consist of interconnected layers of neurons, including input, hidden, and output layers, where each neuron processes input data and passes it through an activation function.These models are widely used in various applications such as image and speech recognition, natural language processing, and predictive analytics due to their ability to learn and generalize from data.Training an ANN involves adjusting the weights of the connections between neurons using algorithms like backpropagation to minimize error.Despite their power, ANNs can require large amounts of data and computational resources to achieve high performance.

Multi-layer perceptron
Multi-layer perceptron (MLP) is a type of artificial neural network that consists of multiple layers of neurons, typically including an input layer, one or more hidden layers, and an output layer 30 .Each neuron in an MLP uses a nonlinear activation function to process input data, allowing the network to capture complex patterns and relationships in the data.MLPs are particularly effective for supervised learning tasks such as classification and regression.The backpropagation algorithm is commonly used to train MLPs by adjusting the weights of the connections to minimize the error between the predicted and actual outputs.MLPs have been successfully applied in various fields, including image and speech recognition, finance, and healthcare.

Proposed methodology
The architecture of the proposed technique using the weighted CNN for blood cancer prediction is depicted in Fig. 2. The relevant details of how it works are provided in the subsequent sections.

Weighted voting ensemble
A CNN model's weighted voting ensemble (W.V.E.) is essential for improving robustness and prediction accuracy 36 .Using ensemble learning techniques is very beneficial in prediction, where accuracy and reliability are critical.This is because the approaches combine the benefits of numerous CNN models, each of which has a unique advantage in collecting different aspects of the data.In addition to improving accuracy, this variety strengthens predictions against noise and outliers by allowing models that are susceptible, by giving lower weights.These ensembles also aid in promoting improved generalization, mitigating overfitting, lowering prediction variance, and balancing biases within individual models.They are a flexible tool for modeling intricate decision limits and adjusting to shifting data patterns because of their versatility and capacity to fine-tune model weights 37 .To sum up, weighted voting ensemble CNN models are a useful tool for achieving more precise and reliable predictions in a variety of applications.In this paper, an ensemble model WV-CNN is proposed which is derived from three 1-D CNN models to attain high accuracy and leverage the robust decision-making strength of multiple models for enhanced cancer prediction.The following lists the model structures of the three contributing CNN models 38 .This dense layer has six neurons.

ECN-2 (ensemble convolutional network 2)
To get class probability values, the second Sequential convolution model, ECN-2, is fed the feature vector v' .Table 8 details the model design, which includes a 1-D convolutional layer with Fn and Ks both set to 3. To prevent over-fitting, a flatten layer, a dense layer with eight neurons, and a dropout layer with a value of 0.3 are added next.With as many neurons as the class length, the final layer is dense.

ECN-3 ensemble convolutional network 3
To get classification scores, the proposed Sequential network ECN-2 receives the input feature vector v' , just like in the first two models.The model's layer structure is represented by a 1-D Convolutional Layer with Fn = 5 and Ks = 3, with ReLU serving as the activation function, as shown in Table 8.The next set of layers uses a softmax classifier and consists of an output-dense layer and a flatten layer with as many neurons as there are classifications that need to be predicted.Table 9 provides the parametric setting for every model in the ensemble.Adam is utilized as the optimizer for all models in the ensemble because of its quicker computing and less complicated parametric configuration requirements.The gradient's scaling term ( β 2) and momentum ( β ) are set at 0.999

Weighted voting convolutional neural network
The increased prediction values are obtained by an ensemble voting regimen that uses the output probabilities of the three suggested models, ECN1, ECN2, and ECN3.Since each model's contribution to the final probability depends on a set weight value, these agreed-upon class probabilities represent the outcome of the vote from each model 37,39 .The weights W1, W2, and W3 for each of the suggested models ECN1, ECN2, and ECN3, as shown in Fig. 2, are set at 0.4, 0.3, and 0.3, respectively, just like in our situation.The final probability vectors for the models ECN1, ECN2, and ECN are represented by the symbols p1, p2, and p3 for a given input feature vector V.The mathematical expression for the weighted average of the probability over all three models is given in Eq. ( 7).
Choosing the class with the greatest likelihood is done by Eq. ( 8)

Evaluation parameter
Performance measures for the machine learning models are evaluated, including recall, precision, F1 score, and accuracy.All these metrics were used in this study for the evaluation of the machine learning models based on the confusion matrix.This matrix is a tabular tool that shows how well the model performs in classifying test data.

Accuracy
The accuracy score reflects the precision of a model's predictions, showing how closely the model's predictions match the actual results.It serves as a measure that quantifies the model's capability to make accurate predictions.The accuracy score is computed by dividing the total number of correct guesses by the total number of forecasts.An ideal model achieves a perfect accuracy score of 1, while the lowest possible score is 0. It is defined in Eq. ( 9) where, • True positive (TP): when a patient's real label is 'Healthy' even when the model accurately predicts them to be 'Healthy' .This shows that the patient's actual condition and the model's forecast match.• True negative (TN): when the model accurately predicts an 'Un-Healthy' patient when the actual label cor- responds to 'Un-Healthy' .Similar to TP, TN signifies that the model's prediction aligns with the patient's actual condition.• False positive (FP): when the model incorrectly predicts a patient as 'Healthy' while the true label is 'Un- Healthy' .This represents a scenario where the model fails to identify an infected patient, leading to a false sense of normalcy.• False negative (FN): when the model erroneously predicts a patient as 'Un-Healthy' when the true label is 'Healthy' .FN indicates that the model misclassifies a healthy patient as infected, resulting in unnecessary concern or treatment.

Precision
The number of correctly guessed positive cases compared to all the times thought as positive is called precision.
The best score a model can get for accuracy is 1.The lowest it can get is 0. Precision is calculated using Eq. ( 10)  11) is used to calculate the recall

F1 score
The balance between recall and precision is represented by the F1 score, also known as the F measure.It illustrates how these two metrics might be compromised, with a model obtaining an F1 score of at least 0 and up to 1. F1 score is calculated using Equation 12Experiments and analysis In this research, we use various machine learning models to identify blood cancer, in addition to the proposed weighted CNN approach.In order to prevent overfitting, the dataset is split into training and testing sets using the standard 85:15 ratio, which is often used in classification studies.We employ a range of metrics intended for machine learning classifiers to assess the models' performance.The operating system is Microsoft Windows 10 and all experiments are conducted in a Python environment, employing different libraries like sklearn to implement machine learning models, Keras to implement deep learning models, seaborn for visualization, and imblearn for SMOTE sampling.Two× Intel Xeon 8 Cores processors operating at 2.4 GHz, along with 32 GB of DDR4 RAM, power the Dell PowerEdge T430 GPU, which does the computations.

Experimental results with the original dataset
Table 10 presents a comprehensive summary of the findings from the initial experiment on machine learning performance using the original blood cancer dataset.In this dataset, the LR, KNN, and SVC models exhibit higher values of accuracy, with the highest accuracy of 91%.Notably, the proposed WVCNN model stands out for its exceptional performance in terms of recall, precision, accuracy, and F1 score, which is 92%, 96%, 90%, and 90% respectively on the original dataset.This shows the significance of the proposed approach in predicting blood cancer.The remarkable performance of WVCNN can be attributed to its hybrid architecture.LR, SVC, and KNN also demonstrate commendable performance, in comparison with the accuracy of WVCNN.On the other hand, ADA is the least performer among other learning models used in this study.It is due to the limited size of the dataset, impeding its boosting approach that relies on a larger number of records for enhanced accuracy.In confusion matrices, B-CELL ALL, B-CELL ALL ETV6-RUNX1, B-CELL ALL HYPERDIP, B-CELL ALL HYPO, B-CELL ALL MLL, B-CELL ALL T-ALL, and B-CELL ALL TCF3-PBX1 are represented by the numbers 0, 1, (

Performance of models following implementation of Chi2 technique
The Chi2 method is utilized to evaluate the performance of both the learning models and the proposed system.As indicated by the results in Table 12, Chi2 has a negligible effect on the performance of the models.In particular, WVCNN maintains its original accuracy score of 92%, consistent with the observations from the initial dataset.
In the case of the original dataset, the use of Chi2 enhances ADA's performance from 65% to 72% and DT's performance from 72 to 74%.It's important to note that the primary goal is not to surpass SMOTE-TOMEK in terms of model performance.Chi2 serves to decrease complexity and enhance overall performance by choosing only significant features to train the model, resulting in improved performance compared to the original dataset.

Experimental results integrating Chi2 and SMOTE-Tomek
The models exhibit notably enhanced performance when the preprocessing involves a combination of the SMOTE-TOMEK and Chi2 approaches.By employing SMOTE-TOMEK, new data is generated to mitigate the risk of model overfitting for the majority class, while Chi2 selects the most pertinent features by assessing their correlation with the target class.As depicted in Table 13, the results demonstrate a substantial improvement when both approaches are applied simultaneously.All models perform well on average, but the recommended WVCNN model outperforms the others, achieving a flawless accuracy of 99.9%.In comparison, LR, ETC, and KNN demonstrate accuracy values of 97%, 97%, and 95%, respectively, while RF and SVC attain accuracy scores of 99%.This study independently performs feature selection on both the training and testing sets subsequent to data splitting.This approach is adopted to evaluate the significance of WVCNN and to preempt any potential data leakage that might arise if feature selection were to occur prior to the data split.The experimental data for each model is detailed in Table 14, revealing that the suggested model, WVCNN, outperforms all others with the highest accuracy score of 98%.WVCNN distinguishes itself with an accuracy score of 98%, establishing its prominence among all evaluation factors.It is closely followed by LR, RF, and ETC models.
Table 15 shows a comparison of the proposed model's performance involving the original dataset, feature engineering with Chi2, data oversampling using SMOTE-Tomek, and integrating both Chi2 and SMOTE-Tomek.Results show superior performance when Chin2 is used for feature selection on the dataset oversampled using SMOTe-Tomek.These noteworthy findings imply that feature selection can be executed either before or after data splitting without causing data leakage.It is pertinent to highlight that the reported results were attained when feature selection was performed after the data-splitting process.

Results using K-fold cross-validation
To ensure the reliability of the models, we employed a technique known as K-fold cross-validation.The outcomes of this five-fold cross-validation are detailed in Table 16.The table distinctly illustrates that our proposed

Performance of weighted CNN concerning existing studies
The suggested approach's performance is compared with several cutting-edge methods in order to confirm its efficacy and superiority; the outcomes are shown in Table 17.These papers were chosen because their experiments were conducted using the same dataset.For example, Castillo et al. 15 examined a range of machine learning models, with RF achieving an accuracy of 97.28 percent.Similar to this, Nazari et al. 14 predicted blood cancer with 96.6 percent accuracy by using a deep learning technique.Our method produced 100% accuracy on the same dataset when compared to this research, indicating the importance of the suggested method.

Conclusion
This study's main goal is to use an unbalanced dataset to predict the leukemia type of blood cancer.Accuracy with imbalanced and high-dimensional datasets is still a constant and difficult quest despite the abundance of known methods.A hybrid weighted CNN learning model is presented as a response to this.This model predicts cancer by using leukemia microarray gene data.The SMOTE-Tomek oversampling technique is utilized due to the dataset's imbalance, and Chi-square is employed to remove characteristics that show minimal association with the target class, hence mitigating the high dimensionality problem.Extensive experiments are carried out with the proposed hybrid approach that incorporates both methodologies and with the SMOTE-Tomek and Chi-square approaches independently.SMOTE-Tomek helps improve the performance of the proposed weighted CNN model as well as machine learning models.By lowering the risk of model overfitting, the oversampling method helps to mitigate class imbalances and improve performance.On the other hand, Chi-square only little affects prediction accuracy.Although Chi-square reduces complexity by training models with only the most pertinent features, not all models benefit from improved performance.Weighted CNN performs better in the suggested method, which combines the use of SMOTE-Tomek and Chi-square, and achieves an impressive 99.9% accuracy in cancer prediction.Utilizing the optimized features, the hybrid architecture of weighted CNN achieves a satisfactory fit and generates predictions based on a majority voting criterion.Furthermore, a comparison with cutting-edge methods confirms the legitimacy and superiority of the proposed method.The www.nature.com/scientificreports/results imply that an ensemble of models can perform better than a single model.The creation of a customized deep learning model that is suited for small datasets is planned for future endeavors.Additionally, integrating several datasets to produce a complex and high-dimensional dataset is being considered for use in studies utilizing the proposed approach.

Table 1 .
Summary of the related work.

Table 2 .
Class-wise count of the dataset.

Table 3 .
Sample from the dataset.

Table 4 .
Count of the target after SMOTE-Tomek.

Table 5 .
Count of features after chi-2 feature extraction technique.

Table 6 .
Training testing ratio of all features.

Table 7 .
Hyper -parameter tuning of all supervised learning models.

Table 9 .
Configuring parametrically the proposed WVCNN ensemble models.RecallRecall, also known as True Positive Rate (TPR) or Sensitivity, indicates how successfully the classifier can recognize each and every positive sample.It is calculated by dividing the sum of True Positives (TP) and False Negatives (FN) by the ratio of TP to FN.A model can have a maximum recall score of 1 and a minimum value of 0. Eq. ( Vol:.(1234567890) Scientific Reports | (2024) 14:15625 | https://doi.org/10.1038/s41598-024-65315-7www.nature.com/scientificreports/

Table 10 .
Learning model result on the original dataset.The models show better performance after using the SMOTE-Tomek technique, which helps balance the dataset for all classes by generating new data.Enhancing data balance not only enlarges the dataset but also enhances model performance while mitigating the likelihood of overfitting.The outcomes following the implementation of the SMOTE-Tomek technique are outlined in Table11.LR, RF, SVC, ETC, and WVCNN demonstrate commendable performance, attaining an accuracy score of 97%.Conversely, ADA exhibits subpar performance due to the dataset's insufficient size, hindering the boosting algorithm's effective fitting.

Table 11 .
Performance of the models using SMOTE-Tomek oversampling.

Table 12 .
Learning models results with the implementation of the Chi2 technique.

Table 13 .
Learning models results with the integration of Chi2 and SMOTE-TOMEK.

Table 14 .
Learning models results when feature selection after data splitting.

Table 15 .
40mparison of learning model results employing different feature engineering techniques.approachoutperformsothermodels in terms of accuracy, precision, recall, and F1 score, demonstrating minimal variability.This suggests that our approach is not only effective but also consistently performs well.Performance validation of weighted CNNIn this section, we have done performance validation of the proposed model utilizing two scenarios.In the first scenario, we utilized another independent 2023 cancer dataset and tested our model performance on it40.This dataset includes patient attributes for those diagnosed with cancer.It features a unique identifier for each patient, the type of cancer (diagnosis), and the average values of these characteristics.The dataset also includes several categorical features where patients are assigned numerical values for attributes such as texture_mean, radius_mean, perimeter_mean, smoothness_mean, area_mean, compactness_mean, concave points_mean, and concavity_mean.Results reveal that the proposed model gives 98.94% accuracy, 98.98% precision, 99.25% recall, and 99.15% F1 score.These results values show the superiority and stability of the proposed model on a diverse dataset.In the second scenario, we trained our proposed model on chi2 + SMOTE-Tomek significant features and tested the proposed model on original dataset features.The results reveal that the proposed model performs quite well if we compare the results of Table8which the model trained and tested on original features of the dataset.The proposed model gives 96.52% accuracy, 98.15% precision, 96.37% recall, and 97.67% F1 score.

Table 16 .
5-fold cross-validation results for the proposed approach.

Table 17 .
Proposed approach comparison with other state-of-the-art systems.